Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinho D. Choi

Tinker Tales: Supporting Child-AI Collaboration through Co-Creative Storytelling with Educational Scaffolding

Feb 04, 2026

Nayoung Choi, Jiseung Hong, Peace Cyebukayire, Ikseon Choi, Jinho D. Choi

Abstract:Artificial intelligence (AI) is increasingly framed as a collaborative partner in creative activities, yet children's interactions with AI have largely been studied in AI-led instructional settings rather than co-creative collaboration. This leaves open questions about how children can meaningfully engage with AI through iterative co-creation. We present Tinker Tales, a tangible storytelling system designed with narrative and social-emotional scaffolding to support child-AI collaboration. The system combines a physical storytelling board, NFC-embedded toys representing story elements (e.g., characters, places, items, and emotions), and a mobile app that mediates child-AI interaction. Children shape and refine stories by placing and moving story elements and interacting with the AI through tangible and voice-based interaction. We conducted an exploratory user study with 10 children to examine how they interacted with Tinker Tales. Our findings show that children treated the AI as an attentive, responsive collaborator, while scaffolding supported coherent narrative refinement without diminishing children's agency.

Via

Access Paper or Ask Questions

RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures

Jan 26, 2026

Andrew Jaffe, Noah Reicin, Jinho D. Choi

Abstract:Large Language Models (LLMs) are increasingly relied upon for complex workflows, yet their ability to maintain flow of instructions remains underexplored. Existing benchmarks conflate task complexity with structural ordering, making it difficult to isolate the impact of prompt topology on performance. We introduce RIFT, Reordered Instruction Following Testbed, to assess instruction following by disentangling structure from content. Using rephrased Jeopardy! question-answer pairs, we test LLMs across two prompt structures: linear prompts, which progress sequentially, and jumping prompts, which preserve identical content but require non-sequential traversal. Across 10,000 evaluations spanning six state-of-the-art open-source LLMs, accuracy dropped by up to 72% under jumping conditions (compared to baseline), revealing a strong dependence on positional continuity. Error analysis shows that approximately 50% of failures stem from instruction-order violations and semantic drift, indicating that current architectures internalize instruction following as a sequential pattern rather than a reasoning skill. These results reveal structural sensitivity as a fundamental limitation in current architectures, with direct implications for applications requiring non-sequential control flow such as workflow automation and multi-agent systems.

* 13 pages, 5 figures, submitted to ACL ARR

Via

Access Paper or Ask Questions

DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

Jan 12, 2026

Nayoung Choi, Jonathan Zhang, Jinho D. Choi

Abstract:Large Language Models (LLMs) often exhibit increased response latency and degraded answer quality as dialogue length grows, making effective context management essential. However, existing methods rely on extra LLM calls to build memory or perform offline memory construction without considering the current user utterance, which can introduce inefficiencies or disrupt conversational continuity. We introduce DyCP, a lightweight context management method that dynamically segment and retrieve relevant memory at query time. It preserves the sequential structure of dialogue without predefined topic boundaries and supports efficient, adaptive context retrieval. Across three long-form dialogue benchmarks, LoCoMo, MT-Bench+, and SCM4LLMs, and multiple LLMs, DyCP consistently improves answer quality while reducing response latency. We also examine the gap between modern LLMs' expanded context windows and their actual long-context processing capacity, highlighting the continued importance of effective context management.

* Accepted (B) to TACL 2026

Via

Access Paper or Ask Questions

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Nov 17, 2025

Grace Byun, Swati Rajwal, Jinho D. Choi

Abstract:Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.

Via

Access Paper or Ask Questions

Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Jun 12, 2025

Peilin Wu, Jinho D. Choi

Figure 1 for Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Figure 2 for Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Figure 3 for Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Figure 4 for Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

Abstract:We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and speaker similarity, our approach utilizes the dialogue transcript alone. Two models are developed: the Single Prediction Model (SPM) and the Multiple Prediction Model (MPM), both of which demonstrate significant improvements in identifying speaker changes, particularly in short conversations. Our findings, based on a curated dataset encompassing diverse conversational scenarios, reveal that the text-based SD approach, especially the MPM, performs competitively against state-of-the-art audio-based SD systems, with superior performance in short conversational contexts. This paper not only showcases the potential of leveraging linguistic features for SD but also highlights the importance of integrating semantic understanding into SD systems, opening avenues for future research in multimodal and semantic feature-based diarization.

Via

Access Paper or Ask Questions

TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

Apr 30, 2025

Sichang Tu, Abigail Powers, Stephen Doogan, Jinho D. Choi

Figure 1 for TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

Figure 2 for TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

Figure 3 for TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

Figure 4 for TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

Abstract:Objectives: While Large Language Models (LLMs) have been widely used to assist clinicians and support patients, no existing work has explored dialogue systems for standard diagnostic interviews and assessments. This study aims to bridge the gap in mental healthcare accessibility by developing an LLM-powered dialogue system that replicates clinician behavior. Materials and Methods: We introduce TRUST, a framework of cooperative LLM modules capable of conducting formal diagnostic interviews and assessments for Post-Traumatic Stress Disorder (PTSD). To guide the generation of appropriate clinical responses, we propose a Dialogue Acts schema specifically designed for clinical interviews. Additionally, we develop a patient simulation approach based on real-life interview transcripts to replace time-consuming and costly manual testing by clinicians. Results: A comprehensive set of evaluation metrics is designed to assess the dialogue system from both the agent and patient simulation perspectives. Expert evaluations by conversation and clinical specialists show that TRUST performs comparably to real-life clinical interviews. Discussion: Our system performs at the level of average clinicians, with room for future enhancements in communication styles and response appropriateness. Conclusions: Our TRUST framework shows its potential to facilitate mental healthcare availability.

* 5 figures, 4 tables

Via

Access Paper or Ask Questions

Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions

Apr 25, 2025

James D. Finch, Yasasvi Josyula, Jinho D. Choi

Abstract:In task-oriented dialogue (TOD) systems, Slot Schema Induction (SSI) is essential for automatically identifying key information slots from dialogue data without manual intervention. This paper presents a novel state-of-the-art (SoTA) approach that formulates SSI as a text generation task, where a language model incrementally constructs and refines a slot schema over a stream of dialogue data. To develop this approach, we present a fully automatic LLM-based TOD simulation method that creates data with high-quality state labels for novel task domains. Furthermore, we identify issues in SSI evaluation due to data leakage and poor metric alignment with human judgment. We resolve these by creating new evaluation data using our simulation method with human guidance and correction, as well as designing improved evaluation metrics. These contributions establish a foundation for future SSI research and advance the SoTA in dialogue understanding and system development.

* Accepted (B) to TACL 2025

Via

Access Paper or Ask Questions

D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model

Apr 18, 2025

Grace Byun, Jinho D. Choi

Abstract:Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality, we propose two novel methods: (1) ranking alignment, ensuring generated distractors retain the discriminatory power of ground-truth distractors, and (2) entropy analysis, comparing model confidence distributions. Our results show that D-GEN preserves ranking consistency (Spearman's rho 0.99, Kendall's tau 0.94) and closely matches the entropy distribution of ground-truth distractors. Human evaluation further confirms the fluency, coherence, distractiveness, and incorrectness. Our work advances robust and efficient distractor generation with automated evaluation, setting a new standard for MC evaluation.

Via

Access Paper or Ask Questions

Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering

Apr 18, 2025

Grace Byun, Shinsun Lee, Nayoung Choi, Jinho D. Choi

Abstract:Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.

Via

Access Paper or Ask Questions

Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Apr 17, 2025

Nayoung Choi, Peace Cyebukayire, Jinho D. Choi

Figure 1 for Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Figure 2 for Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Figure 3 for Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Figure 4 for Tinker Tales: Interactive Storytelling Framework for Early Childhood Narrative Development and AI Literacy

Abstract:This paper presents Tinker Tales, an interactive storytelling framework in the format of a board game, designed to support both narrative development and AI literacy in early childhood. The framework integrates tangible and speech-based interactions with AI through NFC chip-attached pawns and tokens, along with a speaker and microphone. Children select and define key story elements-such as characters, places, items, and emotions-using the pawns and tokens, providing further details to the AI and receiving proper assistance, similar to how adults prompt AI for specific tasks (e.g., writing). For evaluation, several game sessions were simulated with a child AI agent, and the quality and safety of the generated stories were assessed from various perspectives. This work highlights the potential of combining physical and digital elements in AI literacy, offering a safe and engaging way for children to learn how to effectively collaborate with AI.

Via

Access Paper or Ask Questions